Credit Card Users Churn Prediction¶

Problem Statement¶

Business Context¶

The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.

Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas

You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards

Data Description¶

  • CLIENTNUM: Client number. Unique identifier for the customer holding the account
  • Attrition_Flag: Internal event (customer activity) variable - if the account is closed then "Attrited Customer" else "Existing Customer"
  • Customer_Age: Age in Years
  • Gender: Gender of the account holder
  • Dependent_count: Number of dependents
  • Education_Level: Educational Qualification of the account holder - Graduate, High School, Unknown, Uneducated, College(refers to college student), Post-Graduate, Doctorate
  • Marital_Status: Marital Status of the account holder
  • Income_Category: Annual Income Category of the account holder
  • Card_Category: Type of Card
  • Months_on_book: Period of relationship with the bank (in months)
  • Total_Relationship_Count: Total no. of products held by the customer
  • Months_Inactive_12_mon: No. of months inactive in the last 12 months
  • Contacts_Count_12_mon: No. of Contacts in the last 12 months
  • Credit_Limit: Credit Limit on the Credit Card
  • Total_Revolving_Bal: Total Revolving Balance on the Credit Card
  • Avg_Open_To_Buy: Open to Buy Credit Line (Average of last 12 months)
  • Total_Amt_Chng_Q4_Q1: Change in Transaction Amount (Q4 over Q1)
  • Total_Trans_Amt: Total Transaction Amount (Last 12 months)
  • Total_Trans_Ct: Total Transaction Count (Last 12 months)
  • Total_Ct_Chng_Q4_Q1: Change in Transaction Count (Q4 over Q1)
  • Avg_Utilization_Ratio: Average Card Utilization Ratio

What Is a Revolving Balance?¶

  • If we don't pay the balance of the revolving credit account in full every month, the unpaid portion carries over to the next month. That's called a revolving balance
What is the Average Open to buy?¶
  • 'Open to Buy' means the amount left on your credit card to use. Now, this column represents the average of this value for the last 12 months.
What is the Average utilization Ratio?¶
  • The Avg_Utilization_Ratio represents how much of the available credit the customer spent. This is useful for calculating credit scores.
Relation b/w Avg_Open_To_Buy, Credit_Limit and Avg_Utilization_Ratio:¶
  • ( Avg_Open_To_Buy / Credit_Limit ) + Avg_Utilization_Ratio = 1

Step 0 - Domain Knowledge & Initial Hypothesis¶

Domain Knowledge

  • Customer Attrition: The core problem is customer churn (attrition) in the credit card sector. Understanding why customers leave is crucial for retention strategies.

  • Revenue Streams: Banks earn revenue from credit cards through various fees (annual, late payment, balance transfer, etc.). The balance between usage-based and fixed fees is important.

  • Customer Behavior: The data captures a mix of demographics (age, education, income), relationship metrics (time with bank, products held), and credit card usage patterns (spending, balance, inactivity).

Potential Hypotheses

We can categorize hypotheses to explore relationships between customer attributes and attrition:

Demographic Factors:

  • Hypothesis 1: Younger customers are more likely to churn due to changing financial priorities.

  • Hypothesis 2: Customers with higher education levels might be less likely to churn due to better financial literacy.

  • Hypothesis 3: Higher income customers might be less sensitive to fees and less likely to churn.

Relationship Factors:

  • Hypothesis 4: Customers with longer relationships with the bank are less likely to churn.

  • Hypothesis 5: Customers with more products at the bank are more likely to stay due to higher engagement.

  • Hypothesis 6: Frequent contact with the bank (e.g., customer service) might reduce churn.

End goal

To build a model that can predict whether a customer is likely to become an "Attrited Customer" so that the bank can take proactive measures to retain them.

Importing necessary libraries¶

In [1]:
# Installing the libraries with the specified version.
# uncomment and run the following line if Google Colab is being used
# !pip install scikit-learn==1.2.2 seaborn==0.13.1 matplotlib==3.7.1 numpy==1.25.2 pandas==1.5.3 imbalanced-learn==0.10.1 xgboost==2.0.3 -q --

# Data manipulation
import numpy as np
import pandas as pd
import seaborn as sns

# Visualization
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go

# remove warning errors
import warnings
warnings.filterwarnings('ignore')

# Scikit learn
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, confusion_matrix, classification_report
from sklearn.pipeline import Pipeline

# SMOTE
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

# To help with model building
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
    AdaBoostClassifier,
    GradientBoostingClassifier,
    RandomForestClassifier,
    BaggingClassifier,
)

from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression

# To suppress scientific notations
pd.set_option("display.float_format", lambda x: "%.3f" % x)

# To supress warnings
import warnings

warnings.filterwarnings("ignore")
In [2]:
# Installing the libraries with the specified version.
# uncomment and run the following lines if Jupyter Notebook is being used
# !pip install scikit-learn==1.2.2 seaborn==0.13.1 matplotlib==3.7.1 numpy==1.25.2 pandas==1.5.3 imblearn==0.12.0 xgboost==2.0.3 -q --user
# !pip install --upgrade -q threadpoolctl

Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.

Loading the dataset¶

In [3]:
df = pd.read_csv("/content/BankChurners.csv")

Data Overview¶

  • Observations
  • Sanity checks
In [4]:
# Initial first rows
df.head()
Out[4]:
CLIENTNUM Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book ... Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
0 768805383 Existing Customer 45 M 3 High School Married $60K - $80K Blue 39 ... 1 3 12691.000 777 11914.000 1.335 1144 42 1.625 0.061
1 818770008 Existing Customer 49 F 5 Graduate Single Less than $40K Blue 44 ... 1 2 8256.000 864 7392.000 1.541 1291 33 3.714 0.105
2 713982108 Existing Customer 51 M 3 Graduate Married $80K - $120K Blue 36 ... 1 0 3418.000 0 3418.000 2.594 1887 20 2.333 0.000
3 769911858 Existing Customer 40 F 4 High School NaN Less than $40K Blue 34 ... 4 1 3313.000 2517 796.000 1.405 1171 20 2.333 0.760
4 709106358 Existing Customer 40 M 3 Uneducated Married $60K - $80K Blue 21 ... 1 0 4716.000 0 4716.000 2.175 816 28 2.500 0.000

5 rows × 21 columns

In [5]:
# Initial last few rows
df.tail()
Out[5]:
CLIENTNUM Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book ... Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
10122 772366833 Existing Customer 50 M 2 Graduate Single $40K - $60K Blue 40 ... 2 3 4003.000 1851 2152.000 0.703 15476 117 0.857 0.462
10123 710638233 Attrited Customer 41 M 2 NaN Divorced $40K - $60K Blue 25 ... 2 3 4277.000 2186 2091.000 0.804 8764 69 0.683 0.511
10124 716506083 Attrited Customer 44 F 1 High School Married Less than $40K Blue 36 ... 3 4 5409.000 0 5409.000 0.819 10291 60 0.818 0.000
10125 717406983 Attrited Customer 30 M 2 Graduate NaN $40K - $60K Blue 36 ... 3 3 5281.000 0 5281.000 0.535 8395 62 0.722 0.000
10126 714337233 Attrited Customer 43 F 2 Graduate Married Less than $40K Silver 25 ... 2 4 10388.000 1961 8427.000 0.703 10294 61 0.649 0.189

5 rows × 21 columns

Insight

  • Education level has NAN values
  • Marital Status has NAN values
In [6]:
# Shape of dataset
df.shape
Out[6]:
(10127, 21)

total 21 columns with 10,172 rows

In [7]:
# Data types
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10127 entries, 0 to 10126
Data columns (total 21 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   CLIENTNUM                 10127 non-null  int64  
 1   Attrition_Flag            10127 non-null  object 
 2   Customer_Age              10127 non-null  int64  
 3   Gender                    10127 non-null  object 
 4   Dependent_count           10127 non-null  int64  
 5   Education_Level           8608 non-null   object 
 6   Marital_Status            9378 non-null   object 
 7   Income_Category           10127 non-null  object 
 8   Card_Category             10127 non-null  object 
 9   Months_on_book            10127 non-null  int64  
 10  Total_Relationship_Count  10127 non-null  int64  
 11  Months_Inactive_12_mon    10127 non-null  int64  
 12  Contacts_Count_12_mon     10127 non-null  int64  
 13  Credit_Limit              10127 non-null  float64
 14  Total_Revolving_Bal       10127 non-null  int64  
 15  Avg_Open_To_Buy           10127 non-null  float64
 16  Total_Amt_Chng_Q4_Q1      10127 non-null  float64
 17  Total_Trans_Amt           10127 non-null  int64  
 18  Total_Trans_Ct            10127 non-null  int64  
 19  Total_Ct_Chng_Q4_Q1       10127 non-null  float64
 20  Avg_Utilization_Ratio     10127 non-null  float64
dtypes: float64(5), int64(10), object(6)
memory usage: 1.6+ MB

Insight

  • Total 21 columns, mixed of dtypes: float64(5), int64(10), object(6) for 10,127 total entries

  • Missing values present for Education Level and Marital status which will need to be addressed

  • The CLIENTNUM column appears to be unique for each customer, confirming it's a valid identifier and can be dropped.

  • The columns Attrition_Flag, Gender, Education_Level, Marital_Status, Income_Category, and Card_Category are all of type Object, which is often used for categorical. Convert these to the Category datatype

Missing values & Duplicates¶

In [8]:
# Misisng Values
df.isnull().sum()
Out[8]:
CLIENTNUM                      0
Attrition_Flag                 0
Customer_Age                   0
Gender                         0
Dependent_count                0
Education_Level             1519
Marital_Status               749
Income_Category                0
Card_Category                  0
Months_on_book                 0
Total_Relationship_Count       0
Months_Inactive_12_mon         0
Contacts_Count_12_mon          0
Credit_Limit                   0
Total_Revolving_Bal            0
Avg_Open_To_Buy                0
Total_Amt_Chng_Q4_Q1           0
Total_Trans_Amt                0
Total_Trans_Ct                 0
Total_Ct_Chng_Q4_Q1            0
Avg_Utilization_Ratio          0
dtype: int64
In [9]:
round(df.isnull().sum() / df.isnull().count() * 100, 2)
Out[9]:
CLIENTNUM                   0.000
Attrition_Flag              0.000
Customer_Age                0.000
Gender                      0.000
Dependent_count             0.000
Education_Level            15.000
Marital_Status              7.400
Income_Category             0.000
Card_Category               0.000
Months_on_book              0.000
Total_Relationship_Count    0.000
Months_Inactive_12_mon      0.000
Contacts_Count_12_mon       0.000
Credit_Limit                0.000
Total_Revolving_Bal         0.000
Avg_Open_To_Buy             0.000
Total_Amt_Chng_Q4_Q1        0.000
Total_Trans_Amt             0.000
Total_Trans_Ct              0.000
Total_Ct_Chng_Q4_Q1         0.000
Avg_Utilization_Ratio       0.000
dtype: float64
In [10]:
# Duplicates
df.duplicated().sum()
Out[10]:
0

Insight

  • Education Level 15% missing data
  • Marital Status 7.40% missing data
  • We will handle NAN values after splitting the data into train, test and validation to avoid data leakage
  • no duplicates
  • CLIETNUM can be dropped

Copy the Data¶

In [11]:
df.drop(["CLIENTNUM"], axis=1, inplace=True)
In [12]:
data = df.copy()

Data 5-Point Summary¶

In [13]:
# Data summary
data.describe(include='all').T
Out[13]:
count unique top freq mean std min 25% 50% 75% max
Attrition_Flag 10127 2 Existing Customer 8500 NaN NaN NaN NaN NaN NaN NaN
Customer_Age 10127.000 NaN NaN NaN 46.326 8.017 26.000 41.000 46.000 52.000 73.000
Gender 10127 2 F 5358 NaN NaN NaN NaN NaN NaN NaN
Dependent_count 10127.000 NaN NaN NaN 2.346 1.299 0.000 1.000 2.000 3.000 5.000
Education_Level 8608 6 Graduate 3128 NaN NaN NaN NaN NaN NaN NaN
Marital_Status 9378 3 Married 4687 NaN NaN NaN NaN NaN NaN NaN
Income_Category 10127 6 Less than $40K 3561 NaN NaN NaN NaN NaN NaN NaN
Card_Category 10127 4 Blue 9436 NaN NaN NaN NaN NaN NaN NaN
Months_on_book 10127.000 NaN NaN NaN 35.928 7.986 13.000 31.000 36.000 40.000 56.000
Total_Relationship_Count 10127.000 NaN NaN NaN 3.813 1.554 1.000 3.000 4.000 5.000 6.000
Months_Inactive_12_mon 10127.000 NaN NaN NaN 2.341 1.011 0.000 2.000 2.000 3.000 6.000
Contacts_Count_12_mon 10127.000 NaN NaN NaN 2.455 1.106 0.000 2.000 2.000 3.000 6.000
Credit_Limit 10127.000 NaN NaN NaN 8631.954 9088.777 1438.300 2555.000 4549.000 11067.500 34516.000
Total_Revolving_Bal 10127.000 NaN NaN NaN 1162.814 814.987 0.000 359.000 1276.000 1784.000 2517.000
Avg_Open_To_Buy 10127.000 NaN NaN NaN 7469.140 9090.685 3.000 1324.500 3474.000 9859.000 34516.000
Total_Amt_Chng_Q4_Q1 10127.000 NaN NaN NaN 0.760 0.219 0.000 0.631 0.736 0.859 3.397
Total_Trans_Amt 10127.000 NaN NaN NaN 4404.086 3397.129 510.000 2155.500 3899.000 4741.000 18484.000
Total_Trans_Ct 10127.000 NaN NaN NaN 64.859 23.473 10.000 45.000 67.000 81.000 139.000
Total_Ct_Chng_Q4_Q1 10127.000 NaN NaN NaN 0.712 0.238 0.000 0.582 0.702 0.818 3.714
Avg_Utilization_Ratio 10127.000 NaN NaN NaN 0.275 0.276 0.000 0.023 0.176 0.503 0.999
In [14]:
data.describe()
Out[14]:
Customer_Age Dependent_count Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
count 10127.000 10127.000 10127.000 10127.000 10127.000 10127.000 10127.000 10127.000 10127.000 10127.000 10127.000 10127.000 10127.000 10127.000
mean 46.326 2.346 35.928 3.813 2.341 2.455 8631.954 1162.814 7469.140 0.760 4404.086 64.859 0.712 0.275
std 8.017 1.299 7.986 1.554 1.011 1.106 9088.777 814.987 9090.685 0.219 3397.129 23.473 0.238 0.276
min 26.000 0.000 13.000 1.000 0.000 0.000 1438.300 0.000 3.000 0.000 510.000 10.000 0.000 0.000
25% 41.000 1.000 31.000 3.000 2.000 2.000 2555.000 359.000 1324.500 0.631 2155.500 45.000 0.582 0.023
50% 46.000 2.000 36.000 4.000 2.000 2.000 4549.000 1276.000 3474.000 0.736 3899.000 67.000 0.702 0.176
75% 52.000 3.000 40.000 5.000 3.000 3.000 11067.500 1784.000 9859.000 0.859 4741.000 81.000 0.818 0.503
max 73.000 5.000 56.000 6.000 6.000 6.000 34516.000 2517.000 34516.000 3.397 18484.000 139.000 3.714 0.999

Insight

Categorical data¶

  • Attrition flag

    • 2 categories, top count is existing customer with 8,500 out of 10,127
  • Gender

    • 2 categories, top count is Female customer with 5358 out of 10,127 (not much difference between gender count)
  • Education level

    • 6 categories, top count is Graduate customers with 3128 out of 8,608
    • Missing values of 1,519
  • Marital status

    • 3 categories, top count is married customers with 4,687 out of 9,378
    • Missing values 749
  • Income category

    • 6 categories, top count is Less than 40k customers with 3,561 out of 10,127
  • Card category

    • 4 categories, top count is Blue card customers with 9436 out of 10,127

Numeric data¶

  • CLIENTNUM

    • Unique for each customer with the bank total of 10127
  • Customer_Age

    • Average age is 46 years old
    • Age ranging from 26 to 73
  • Dependent_count

    • Average customer has 2.3 dependents
    • Ranges from 0 to 5
  • Months_on_book

    • Average of 35 months since customers joined the bank
    • Ranges from 13 to 56 months, no less than 1 year to 4.8 years
  • Total_Relationship_Count

    • Average customers have 3 different products with the bank
    • Ranges from 1 to 6
  • Months_Inactive_12_mon

    • Average customers are inactive for 2.3 months
    • Ranges from 0 to 6 months of inactivity
  • Contacts_Count_12_mon

    • Average contact attempts is 2.5
    • Ranges from 0 to 6 attemps per 12 months
  • Credit_Limit

    • Average credit limit is 8,631
    • Ranges from 1438 to 34,516
    • std is higher than the mean
  • Total_Revolving_Bal
    • Average is 1162
    • Ranges from 0 to 2,517
  • Avg_Open_To_Buy
    • Average is 7,469
    • ranges from 3 to 34,516
    • std is higher than the mean
  • Total_Amt_Chng_Q4_Q1

    • Average of 0.75
    • range from 0 to 3.39
  • Total_Trans_Amt

    • Average of 4,404
    • ranges from 510 to 18,484
  • Total_Trans_Ct
    • Average 64 count
    • ranges from 10 to 139 times within 12 months
  • Total_Ct_Chng_Q4_Q1

    • Average 0.7
    • Ranges from 0 to 3.7
    • Similiar to Total_Amt_Chng_Q4_Q1
  • Avg_Utilization_Ratio

    • Average customers spend 0.27 of their card utilization ratio
    • Ranges from 0 to 0.99

Key Insights and Further Analysis:¶

Based on the business problem, domain knowledge, data summary and the initial hypothesis. Here is what we can understand better when completing the EDA process:

  • Variability: There's substantial variation in customer behavior across age, credit limits, spending patterns, and relationship with the bank. This suggests we may need to segment customers for more targeted analysis.

  • Potential Risk: High average utilization ratios and a wide range of credit limits suggest some customers might be at risk of overspending or financial stress, which could contribute to churn.

  • Relationship Impact: The data doesn't directly reveal the impact of relationship length or product count on churn. Further analysis is needed to understand these relationships.

  • Outliers: Some outliers exist, especially in credit limits and transaction amounts. These should be investigated to determine if they are errors or represent specific customer segments.


Exploratory Data Analysis (EDA)¶

Having established a foundational understanding of the data and the business problem, we'll now conduct univariate and bivariate analyses. This will help us uncover relationships within the data and guide us in accurately addressing missing values before proceeding to model development.

Questions:

  1. How is the total transaction amount distributed?
  2. What is the distribution of the level of education of customers?
  3. What is the distribution of the level of income of customers?
  4. How does the change in transaction amount between Q4 and Q1 (total_ct_change_Q4_Q1) vary by the customer's account status (Attrition_Flag)?
  5. How does the number of months a customer was inactive in the last 12 months (Months_Inactive_12_mon) vary by the customer's account status (Attrition_Flag)?
  6. What are the attributes that have a strong correlation with each other?

Functions need to be defined to carry out the Exploratory Data Analysis.¶

In [15]:
# function to plot a boxplot and a histogram along the same scale.


def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to the show density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a triangle will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram
In [16]:
# function to create labeled barplots


def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot
In [17]:
# function to plot stacked bar chart

def stacked_barplot(data, predictor, target):
    """
    Print the category counts and plot a stacked bar chart

    data: dataframe
    predictor: independent variable
    target: target variable
    """
    count = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]
    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
        by=sorter, ascending=False
    )
    print(tab1)
    print("-" * 120)
    tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
        by=sorter, ascending=False
    )
    tab.plot(kind="bar", stacked=True, figsize=(count + 1, 5))
    plt.legend(
        loc="lower left", frameon=False,
    )
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.show()
In [18]:
# Function to plot stacked bar chart with percentages
def stacked_barplot_perc(data, predictor, target):
    """
    Print the category counts and plot a stacked bar chart with percentages.

    data: DataFrame
    predictor: Independent variable (column name)
    target: Target variable (column name)
    """

    count = data[predictor].nunique()  # Number of unique categories in predictor

    # Sorting for stacked bar chart
    sorter = data[target].value_counts().index[-1]
    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(by=sorter, ascending=False)
    print(tab1)

    # Crosstab for percentage calculations
    tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(by=sorter, ascending=False)

    # Plotting
    tab.plot(kind="bar", stacked=True, figsize=(count + 1, 5))
    plt.xlabel(predictor, fontsize=12)
    plt.ylabel('Percentage', fontsize=12)
    plt.title(f'Proportion of {target} by {predictor}', fontsize=14)  # Add a descriptive title
    plt.legend(title=target, loc="upper left", bbox_to_anchor=(1, 1), frameon=False)

    # Add percentage labels inside bars
    for n, x in enumerate([*tab.index.values]):  # Iterate over each bar (predictor category)
        for (proportion, y_loc) in zip(tab.loc[x], tab.loc[x].cumsum()):
            if proportion != 0:
                plt.text(
                    x=n - 0.17,
                    y=(y_loc - proportion) + (proportion / 2),  # Center the text within the bar segment
                    s=f'{proportion:.1%}',  # Format as percentage with one decimal place
                    color="white",
                    fontsize=10,
                )

    plt.show()
In [19]:
### Function to plot distributions

def distribution_plot_wrt_target(data, predictor, target):

    fig, axs = plt.subplots(2, 2, figsize=(12, 10))

    target_uniq = data[target].unique()

    axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
    sns.histplot(
        data=data[data[target] == target_uniq[0]],
        x=predictor,
        kde=True,
        ax=axs[0, 0],
        color="teal",
    )

    axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
    sns.histplot(
        data=data[data[target] == target_uniq[1]],
        x=predictor,
        kde=True,
        ax=axs[0, 1],
        color="orange",
    )

    axs[1, 0].set_title("Boxplot w.r.t target")
    sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")

    axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
    sns.boxplot(
        data=data,
        x=target,
        y=predictor,
        ax=axs[1, 1],
        showfliers=False,
        palette="gist_rainbow",
    )

    plt.tight_layout()
    plt.show()

Univariate Analysis¶

In [20]:
# Review data
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10127 entries, 0 to 10126
Data columns (total 20 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Attrition_Flag            10127 non-null  object 
 1   Customer_Age              10127 non-null  int64  
 2   Gender                    10127 non-null  object 
 3   Dependent_count           10127 non-null  int64  
 4   Education_Level           8608 non-null   object 
 5   Marital_Status            9378 non-null   object 
 6   Income_Category           10127 non-null  object 
 7   Card_Category             10127 non-null  object 
 8   Months_on_book            10127 non-null  int64  
 9   Total_Relationship_Count  10127 non-null  int64  
 10  Months_Inactive_12_mon    10127 non-null  int64  
 11  Contacts_Count_12_mon     10127 non-null  int64  
 12  Credit_Limit              10127 non-null  float64
 13  Total_Revolving_Bal       10127 non-null  int64  
 14  Avg_Open_To_Buy           10127 non-null  float64
 15  Total_Amt_Chng_Q4_Q1      10127 non-null  float64
 16  Total_Trans_Amt           10127 non-null  int64  
 17  Total_Trans_Ct            10127 non-null  int64  
 18  Total_Ct_Chng_Q4_Q1       10127 non-null  float64
 19  Avg_Utilization_Ratio     10127 non-null  float64
dtypes: float64(5), int64(9), object(6)
memory usage: 1.5+ MB

Numeric¶

Observation on CLIENTNUM¶

all client ids are accounted for, this column does not need to be further analyzed and dropped.

Observation on Age¶

In [21]:
# Age
histogram_boxplot(data, 'Customer_Age')
In [22]:
# break down of age
labeled_barplot(data, 'Customer_Age', perc=True)

Insight

  • Normal distribution of data amoung ages, ages 26 and 65 having a higher peak than surrounding data

  • Highest percentage for ages 44 and 49 with 4.9%

  • Ages 66 to 73 have 0.0% needs further analysis

  • Outliers present on upper right side

Observation on dependent count¶

In [23]:
# Boxplot for Dependent count
histogram_boxplot(data, 'Dependent_count')
In [24]:
# dependent count breakdown
labeled_barplot(data, 'Dependent_count', perc = True)

Insight

  • Customers with dependents of 2 and 3 make up more than 50% of total
  • Bimodal, with peaks at 2 and 3 dependents and lower frequencies for other values.
  • No outliers present
  • Slightly right skewed

Observation on Months on book¶

In [25]:
# Months on book
histogram_boxplot(data, 'Months_on_book')
In [26]:
# break down on months on book
labeled_barplot(data, 'Months_on_book', perc=True)

Insight

  • uniform distribution
  • very high count for 36 months making 24.3% of customers.
  • Outliers present on both sides, more on lower end side.

Futher Insight

  • Month 40 seems important as customers begin to drop off slowly.

Observation on Total_Relationship_Count¶

In [27]:
## Observation on Total_Relationship_Count
histogram_boxplot(data, 'Total_Relationship_Count')
In [28]:
# breakdown by percentage
labeled_barplot(data, 'Total_Relationship_Count', perc=True)

Insight

  • No outliers present
  • Left skewed
  • Majority use 3 products with the bank
  • Totally 78% of customers use 3 to 6 products with the bank
  • 4,5, and 6 product customers have similair percentages
  • Only 9% use 1 product with the bank

Further insight

  • If customers with more products (3-6) have a fewer churn rate compared to customers with 1 to 2 products. They could target those customers and offer more products with the bank.

Observation on Months_Inactive_12_mon¶

In [29]:
# Observation on Months_Inactive_12_mon
histogram_boxplot(data, 'Months_Inactive_12_mon')
In [30]:
# Break down of Months
labeled_barplot(data, 'Months_Inactive_12_mon', perc=True)

Insight

  • Outliers present on both sides, more on upper right side
  • Majority of customers are inactive between 2-3 months (70%)
  • 7.3% are inactive for 4-6 months. Nothing beyond 6 months is shown.

Further insight

  • Are people with less products with the bank (1-2) more inactive? And vice se versa.

Observation on Contract Count 12 months¶

In [31]:
# Observation Contacts_Count_12_mon
histogram_boxplot(data, 'Contacts_Count_12_mon')
In [32]:
# break down of contact
labeled_barplot(data, 'Contacts_Count_12_mon', perc=True)

Insight

  • Similiar to Months_Inactive_12_mon, most contact is made between 2-3 months, and drops from month 4 to 6, nothing beyond 6 months
  • Outliers present on both sides, more on upper right side

Observation on Credit_Limit¶

In [33]:
# Observation on  Credit_Limit
histogram_boxplot(data, 'Credit_Limit')
In [34]:
# Breakdown of credit limit for customers
bin_edges = [0, 5000, 10000, 15000, 20000, 25000, 30000, 35000]
bin_labels = ['0-5k', '5k-10k', '10k-15k', '15k-20k', '20k-25k', '25k-30k', '30k+']

# Create a New Column for Bins
data['Credit_Limit_Bin'] = pd.cut(data['Credit_Limit'], bins=bin_edges, labels=bin_labels, right=False)

# Calculate Percentage per Bin and Sort
binned_data = data['Credit_Limit_Bin'].value_counts(normalize=True).mul(100).reset_index(name='Percentage').rename(columns={'index': 'Credit_Limit_Bin'}).sort_values(by='Credit_Limit_Bin')

# Plotting the Barplot
sns.set(style="whitegrid")
plt.figure(figsize=(10, 6))
sns.barplot(x='Credit_Limit_Bin', y='Percentage', data=binned_data)
plt.title('Distribution of Credit Limits', fontsize=14)
plt.xlabel('Credit Limit Range', fontsize=12)
plt.ylabel('Percentage of Customers (%)', fontsize=12)
plt.show()

Insight

  • Many outliers present on upper right side
  • Right skewed distribution
  • Majority of customers (55% or so) are 0-5k range for credit limit
  • percetnage of credit limit drop significantly after 5k

Further insight

  • which customer have a 30k+ amount in credit limits? Is it better or worse for churn rate?

Observation on Total_Revolving_Bal¶

In [35]:
# Observation on Total_Revolving_Bal
histogram_boxplot(data, 'Total_Revolving_Bal')

Insight

  • No outliers present
  • Right skewed
  • Majority of customers have a low revolving balance. This could be due to not using their credit limit, or using it but paying it off within the month so it does not roll over
  • Another smaller peak at the end range, this could be for those individuals with 30K+ credit limits

Observation on Avg_Open_To_Buy¶

In [36]:
# Observation on Avg_Open_To_Buy
histogram_boxplot(data, 'Avg_Open_To_Buy')

Insight

  • Many outliers present on the upper right side
  • Right skewed and longer whiskers, indicates a smaller group of customers who consistently have higher amounts of unused credit.
  • 50% of the customers have an average open-to-buy of $7,500 or less over the last 12 months.
  • A significant portion of customers have balances closer to zero or in the lower range. This indicates that most customers are using a substantial portion of their credit limit on average.

Further analysis

  • The concentration of customers with lower open-to-buy amounts suggests that many customers regularly utilize a significant portion of their available credit. This could be due to high spending habits or lower credit limits.

In [37]:
# Pull up info
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10127 entries, 0 to 10126
Data columns (total 21 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   Attrition_Flag            10127 non-null  object  
 1   Customer_Age              10127 non-null  int64   
 2   Gender                    10127 non-null  object  
 3   Dependent_count           10127 non-null  int64   
 4   Education_Level           8608 non-null   object  
 5   Marital_Status            9378 non-null   object  
 6   Income_Category           10127 non-null  object  
 7   Card_Category             10127 non-null  object  
 8   Months_on_book            10127 non-null  int64   
 9   Total_Relationship_Count  10127 non-null  int64   
 10  Months_Inactive_12_mon    10127 non-null  int64   
 11  Contacts_Count_12_mon     10127 non-null  int64   
 12  Credit_Limit              10127 non-null  float64 
 13  Total_Revolving_Bal       10127 non-null  int64   
 14  Avg_Open_To_Buy           10127 non-null  float64 
 15  Total_Amt_Chng_Q4_Q1      10127 non-null  float64 
 16  Total_Trans_Amt           10127 non-null  int64   
 17  Total_Trans_Ct            10127 non-null  int64   
 18  Total_Ct_Chng_Q4_Q1       10127 non-null  float64 
 19  Avg_Utilization_Ratio     10127 non-null  float64 
 20  Credit_Limit_Bin          10127 non-null  category
dtypes: category(1), float64(5), int64(9), object(6)
memory usage: 1.6+ MB

Observation on Total_Amt_Chng_Q4_Q1¶

In [38]:
# Observation on Total_Amt_Chng_Q4_Q1
histogram_boxplot(data, 'Total_Amt_Chng_Q4_Q1')
In [39]:
# Define Bin Edges
bin_edges = [0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0]
bin_labels = ['0-0.5', '0.5-1.0', '1.0-1.5', '1.5-2.0', '2.0-2.5', '2.5-3.0']

# Create a New Column for Bins
data['Total_Amt_Chng_Q4_Q1_Binned'] = pd.cut(data['Total_Amt_Chng_Q4_Q1'], bins=bin_edges, labels=bin_labels)

# Plot the Histogram with Bins
plt.figure(figsize=(12, 6))
sns.histplot(data=data, x='Total_Amt_Chng_Q4_Q1_Binned', kde=False)  # Removed kde for clarity
plt.title('Distribution of Change in Transaction Amount (Q4 over Q1) - Binned', fontsize=14)
plt.xlabel('Change in Transaction Amount (Binned)', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks(rotation=45)
plt.show()

Insight

  • Many outliers present, especially on the upper right side.
  • Nearly normal distribution, slightly right skewed (excluding outliers)
  • Majority of customers range from 0.5 to 1.0, means a relatively stable spending pattern for customers

Observation on total transaction amount¶

In [40]:
# Obersvation on total transaction amount
histogram_boxplot(data, 'Total_Trans_Amt')

Insight

  • Many outliers present, especially on upper right side
  • 75% of customers have a transaction amount around \$4,500 or less
  • 4 peaks for transaction amounts for customers spending around \$1500, \$4000, \$8000 and \$150,000

Observation on Total_Trans_Ct¶

In [41]:
# Observation on Total_Trans_Ct
histogram_boxplot(data, 'Total_Trans_Ct')

Insight

  • Binomial distribution ppeaks around 40 and 70, and right skewed
  • Few outliers present upper right side
  • 75% of customers have 80 or less total transaction count
  • Average is around 65

Observation on Total_Ct_Chng_Q4_Q1¶

In [42]:
# Observation on Total_Ct_Chng_Q4_Q1
histogram_boxplot(data, 'Total_Ct_Chng_Q4_Q1')

Insight

  • Normal distribution shape (excluding outliers)
  • Many outliers present on both sides, more on upper right side. Right skewed.
  • Average is 0.7
  • Majority of customers have a stable/predictable spending habits between quarters

Observation on Avg_Utilization_Ratio¶

In [43]:
#Observation on Avg_Utilization_Ratio
histogram_boxplot(data, 'Avg_Utilization_Ratio')

Insight

  • No outliers present, highly right skewed
  • Majority average have 0.5 or less of utilization ratio used.
  • A large majority of customers use a very small percentage of their available credit
  • A very small number of customers have utilization ratios close to 1.0, indicating they are using almost all or all of their available credit

Categorical (Univariate Analysis)¶

In [44]:
# Pull up data
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10127 entries, 0 to 10126
Data columns (total 22 columns):
 #   Column                       Non-Null Count  Dtype   
---  ------                       --------------  -----   
 0   Attrition_Flag               10127 non-null  object  
 1   Customer_Age                 10127 non-null  int64   
 2   Gender                       10127 non-null  object  
 3   Dependent_count              10127 non-null  int64   
 4   Education_Level              8608 non-null   object  
 5   Marital_Status               9378 non-null   object  
 6   Income_Category              10127 non-null  object  
 7   Card_Category                10127 non-null  object  
 8   Months_on_book               10127 non-null  int64   
 9   Total_Relationship_Count     10127 non-null  int64   
 10  Months_Inactive_12_mon       10127 non-null  int64   
 11  Contacts_Count_12_mon        10127 non-null  int64   
 12  Credit_Limit                 10127 non-null  float64 
 13  Total_Revolving_Bal          10127 non-null  int64   
 14  Avg_Open_To_Buy              10127 non-null  float64 
 15  Total_Amt_Chng_Q4_Q1         10127 non-null  float64 
 16  Total_Trans_Amt              10127 non-null  int64   
 17  Total_Trans_Ct               10127 non-null  int64   
 18  Total_Ct_Chng_Q4_Q1          10127 non-null  float64 
 19  Avg_Utilization_Ratio        10127 non-null  float64 
 20  Credit_Limit_Bin             10127 non-null  category
 21  Total_Amt_Chng_Q4_Q1_Binned  10120 non-null  category
dtypes: category(2), float64(5), int64(9), object(6)
memory usage: 1.6+ MB

Observation on Attrition Flag¶

In [45]:
# Countplot for attrition Flag
labeled_barplot(data, 'Attrition_Flag', perc = True)

Insight

  • Existing customer makes the majority of customers with 83.9%
  • Attributed customer making up 16.1%

Further note

This will be the target variable and must account for the class imbalance when building models.

Observation on Gender¶

In [46]:
# Countplot for Gender
labeled_barplot(data, 'Gender', perc = True)

Insight

  • Female customers make up the majority with 52.9%, and males with 47%.
  • Customers are fairly balaced amoung genders

Observation on Education Level¶

In [47]:
# Count plot for education level
labeled_barplot(data, 'Education_Level', perc = True)
In [48]:
data['Education_Level'].value_counts().sum()
Out[48]:
8608

Insight

  • Graduates account for the most frequent count 38.9%, followed by high school 19.9% and uneducated for 14.7%.

  • Lowest percentage is Doctorate and Post-Graduate

  • Note, missing data and will be handled in the model building process

Observation for Marital Status¶

In [49]:
# Count plot for Marital status
labeled_barplot(data, 'Marital_Status', perc =True)
In [50]:
# Missing values count
data['Marital_Status'].value_counts().sum()
Out[50]:
9378

Insight

  • Majority are married with 46.3%, followed by single with 38.9% and on the low end divorced with 7.4%
  • Missing data points, will handle when building model

Observation on Income Category¶

In [51]:
# income category values
data['Income_Category'].unique()
Out[51]:
array(['$60K - $80K', 'Less than $40K', '$80K - $120K', '$40K - $60K',
       '$120K +', 'abc'], dtype=object)
In [52]:
#Change abc to unknown for income category
data['Income_Category'] = data['Income_Category'].replace('abc', 'Unknown')

Data is unclear why abc is a category in income_category. Cannot assume it is supposed to be 100K to 120K as that is missing as a potential category. Changing to unknown to keep data. No reason to believe it should be further manipulated without further confirmation.

In [53]:
# Income category
labeled_barplot(data, 'Income_Category', perc = True);

Insight

  • Most of customers are within the less than 40k bracket for income for 35.2%
  • 75% of total customers are within 80K or less bracket
  • Unknown values present making up 11%
  • Very few people making over 120K+ within customer base

Observation for Card Category¶

In [54]:
#Coutnplot on card category
labeled_barplot(data, 'Card_Category', perc = True)

Insight

  • Majority of customers have blue card status with 93.2%
  • Platinum is the lowest with 0.2%

Further analysis

  • What is the criteria for these categories? Would that data be useful in predicting customer churn?

Bivariate Analysis¶

For bivariate analysis in the Thera Bank customer dataset, we should analyze relationships between the target variable (Attrition_Flag) and various predictor variables to identify potential factors influencing customer churn.

Factors to explore in relation to target¶

Demographic Factors:

  • Gender
  • Education_Level
  • Marital_Status
  • Income_Category
  • Card_Category

Relationship Factors:

  • Total_Relationship_Count (Number of products held by the customer)
  • Months_Inactive_12_mon
  • Contacts_Count_12_mon

Numerical Predictor Variables:

  • Customer Age: Customer_Age
  • Dependent Count: Dependent_count
  • Months on Book: Months_on_book

Credit Card Usage:

  • Credit_Limit
  • Total_Revolving_Bal
  • Avg_Open_To_Buy
  • Total_Amt_Chng_Q4_Q1
  • Total_Trans_Amt
  • Total_Trans_Ct
  • Total_Ct_Chng_Q4_Q1
  • Avg_Utilization_Ratio

Overview of Numeric columns¶

In [55]:
# Get only the numerical columns
numeric_columns = data.select_dtypes(include=['int64', 'float64'])

# Calculate the correlation matrix for the numerical columns
correlation_matrix = numeric_columns.corr()

# Create the heatmap
plt.figure(figsize=(15, 7))
sns.heatmap(correlation_matrix, annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.title('Correlation Heatmap for Numerical Columns')
plt.show()

Insight

  • Total_Trans_Ct and Total_Trans_CT have a farily strong positive correlation
  • Months on book and customer age have a moderate positive correlation
  • Total revolving Balance and Avg Utilization_Ration somewhat positive balance
  • Most negative relation is Avg Open to Buy and Avg_Utilization_ratio

Demographics & Target¶

  • Gender
  • Education_Level
  • Marital_Status
  • Income_Category
  • Card_Category
In [56]:
# Gender & target
stacked_barplot_perc(data, 'Gender', 'Attrition_Flag')
Attrition_Flag  Attrited Customer  Existing Customer    All
Gender                                                     
All                          1627               8500  10127
F                             930               4428   5358
M                             697               4072   4769

Insight

  • Genders are split almost evenly among existing customers and attrited customers.
  • Males having a slightly higher attrition rate compared to males given the total amounts for each gender
In [57]:
# Education level & target
stacked_barplot_perc(data, 'Education_Level','Attrition_Flag')
Attrition_Flag   Attrited Customer  Existing Customer   All
Education_Level                                            
All                           1371               7237  8608
Graduate                       487               2641  3128
High School                    306               1707  2013
Uneducated                     237               1250  1487
College                        154                859  1013
Doctorate                       95                356   451
Post-Graduate                   92                424   516

Insight

  • The churn rate appears relatively consistent across all educational levels. Those in high school and college have the higest attrition rate.

  • Whereas, those with higher education seems to have lower attrition rates.

In [58]:
# marital status & target
stacked_barplot_perc(data, 'Marital_Status', 'Attrition_Flag')
Attrition_Flag  Attrited Customer  Existing Customer   All
Marital_Status                                            
All                          1498               7880  9378
Married                       709               3978  4687
Single                        668               3275  3943
Divorced                      121                627   748

Insight¶

  • No significant difference between marital status for customers
  • Missing data could influence total percentages

Observation on Income category & Target¶

In [59]:
# Income category and target
stacked_barplot_perc(data, 'Income_Category', 'Attrition_Flag')
Attrition_Flag   Attrited Customer  Existing Customer    All
Income_Category                                             
All                           1627               8500  10127
Less than $40K                 612               2949   3561
$40K - $60K                    271               1519   1790
$80K - $120K                   242               1293   1535
$60K - $80K                    189               1213   1402
Unknown                        187                925   1112
$120K +                        126                601    727

Insight

  • No significant difference between income category for customers
In [60]:
# Card category and target with
stacked_barplot_perc(data, 'Card_Category', 'Attrition_Flag')
Attrition_Flag  Attrited Customer  Existing Customer    All
Card_Category                                              
All                          1627               8500  10127
Blue                         1519               7917   9436
Silver                         82                473    555
Gold                           21                 95    116
Platinum                        5                 15     20

Insight

  • Majority of customers are in blue category
  • Silver and Blue card category has the highest attrition rate
  • Those in platinum card category has the lowest attrition rates

Relationship & Target¶

  • Total_Relationship_Count (Number of products held by the customer)
  • Months_Inactive_12_mon
  • Contacts_Count_12_mon
In [61]:
# Total_Relationship_Count & target
stacked_barplot_perc(data, 'Total_Relationship_Count', 'Attrition_Flag')
Attrition_Flag            Attrited Customer  Existing Customer    All
Total_Relationship_Count                                             
All                                    1627               8500  10127
3                                       400               1905   2305
2                                       346                897   1243
1                                       233                677    910
5                                       227               1664   1891
4                                       225               1687   1912
6                                       196               1670   1866

Insight

  • Those with 6 total products have the highest attribution rate
  • Customer who have 1 - 2 products have the lowest attribution rate
In [62]:
# Months_Inactive_12_mon & target
stacked_barplot_perc(data, 'Months_Inactive_12_mon', 'Attrition_Flag')
Attrition_Flag          Attrited Customer  Existing Customer    All
Months_Inactive_12_mon                                             
All                                  1627               8500  10127
3                                     826               3020   3846
2                                     505               2777   3282
4                                     130                305    435
1                                     100               2133   2233
5                                      32                146    178
6                                      19                105    124
0                                      15                 14     29
In [63]:
distribution_plot_wrt_target(data,'Months_Inactive_12_mon', 'Attrition_Flag')

Insight

  • As customers become inactive longer the higher the the chance for them closing thier accounts.

  • New customers have the lowest attribution rate.

Observation on Contacts Count 12 month¶

In [64]:
# Contacts_Count_12_mon & target
stacked_barplot_perc(data, 'Contacts_Count_12_mon', 'Attrition_Flag')
Attrition_Flag         Attrited Customer  Existing Customer    All
Contacts_Count_12_mon                                             
All                                 1627               8500  10127
3                                    681               2699   3380
2                                    403               2824   3227
4                                    315               1077   1392
1                                    108               1391   1499
5                                     59                117    176
6                                     54                  0     54
0                                      7                392    399

Insight

  • Customers who contacted the bank more frequently in the past 12 months are less likely to have churned. And those who did not contact the bank within 12 months have the highest attrition rate.

  • 6 months does not have any existing customers making it 100% attrition rate. this months could be a cricial month to consider.


Numerical Predictors & Target¶

  • Customer Age: Customer_Age
  • Dependent Count: Dependent_count
  • Months on Book: Months_on_book

Observation on Customer Age & target¶

In [65]:
# Customer age & Target
distribution_plot_wrt_target(data, 'Customer_Age', 'Attrition_Flag')
In [66]:
# Customers Age & target
stacked_barplot_perc(data, 'Customer_Age', 'Attrition_Flag')
Attrition_Flag  Attrited Customer  Existing Customer    All
Customer_Age                                               
All                          1627               8500  10127
43                             85                388    473
48                             85                387    472
44                             84                416    500
46                             82                408    490
45                             79                407    486
49                             79                416    495
47                             76                403    479
41                             76                303    379
50                             71                381    452
54                             69                238    307
40                             64                297    361
42                             62                364    426
53                             59                328    387
52                             58                318    376
51                             58                340    398
55                             51                228    279
39                             48                285    333
38                             47                256    303
56                             43                219    262
59                             40                117    157
37                             37                223    260
57                             33                190    223
58                             24                133    157
36                             24                197    221
35                             21                163    184
33                             20                107    127
34                             19                127    146
32                             17                 89    106
61                             17                 76     93
62                             17                 76     93
30                             15                 55     70
31                             13                 78     91
60                             13                114    127
65                              9                 92    101
63                              8                 57     65
29                              7                 49     56
26                              6                 72     78
64                              5                 38     43
27                              3                 29     32
28                              1                 28     29
66                              1                  1      2
68                              1                  1      2
67                              0                  4      4
70                              0                  1      1
73                              0                  1      1

Insight

  • Customers aged 68 and 66 have the lowest attrition rate

  • Data suggests, 40 years old age range has the highest attrition account

  • Customers aged 67, 70 and 73 have no customers who have closed their accounts.

Observation on Dependent Count & Target¶

In [67]:
# Depdendent count & target
stacked_barplot_perc(data, 'Dependent_count', 'Attrition_Flag')
Attrition_Flag   Attrited Customer  Existing Customer    All
Dependent_count                                             
All                           1627               8500  10127
3                              482               2250   2732
2                              417               2238   2655
1                              269               1569   1838
4                              260               1314   1574
0                              135                769    904
5                               64                360    424

Insight

  • Customers with 3 to 4 dependents, have the lowest attrition rates.
  • No significant difference between dependent counts.

Observation on Months on Book & target¶

In [68]:
# Months on Book & Target
distribution_plot_wrt_target(data, 'Months_on_book', 'Attrition_Flag')
In [69]:
# Months on book & Atrritubed
stacked_barplot_perc(data, 'Months_on_book', 'Attrition_Flag')
Attrition_Flag  Attrited Customer  Existing Customer    All
Months_on_book                                             
All                          1627               8500  10127
36                            430               2033   2463
39                             64                277    341
37                             62                296    358
30                             58                242    300
38                             57                290    347
34                             57                296    353
41                             51                246    297
33                             48                257    305
40                             45                288    333
35                             45                272    317
32                             44                245    289
28                             43                232    275
44                             42                188    230
43                             42                231    273
46                             36                161    197
42                             36                235    271
29                             34                207    241
31                             34                284    318
45                             33                194    227
25                             31                134    165
24                             28                132    160
48                             27                135    162
50                             25                 71     96
49                             24                117    141
26                             24                162    186
47                             24                147    171
27                             23                183    206
22                             20                 85    105
56                             17                 86    103
51                             16                 64     80
18                             13                 45     58
20                             13                 61     74
52                             12                 50     62
23                             12                104    116
21                             10                 73     83
15                              9                 25     34
53                              7                 71     78
13                              7                 63     70
19                              6                 57     63
54                              6                 47     53
17                              4                 35     39
55                              4                 38     42
16                              3                 26     29
14                              1                 15     16

Insight

  • Top months with highest attrition rates are in the 30-month range, with 36th being the highest.

  • This could be due to a certain product expiring at 30-36 months, housing, car or loan product related.

Futher analysis

  • See what product those customers had around the 36-month mark that closed thier accounts.

Credit Card Usage¶

  • Credit_Limit
  • Total_Revolving_Bal
  • Avg_Open_To_Buy
  • Total_Amt_Chng_Q4_Q1
  • Total_Trans_Amt
  • Total_Trans_Ct
  • Total_Ct_Chng_Q4_Q1
  • Avg_Utilization_Ratio

Observation on Credit Limit¶

In [70]:
# Credit Limit & target
distribution_plot_wrt_target(data, 'Credit_Limit', 'Attrition_Flag')

Insight

  • Credit limit is similiar between existing and non existing customers
  • Existing customer has more outliers.
  • Median between both is fairly similiar

Observation on total revolving Balance & Target¶

In [71]:
# Total revolving balance & target
distribution_plot_wrt_target(data, 'Total_Revolving_Bal', 'Attrition_Flag')

Insight

  • For existing customers distribution is roughly normal
  • For non existing customers distribution is also roughly normal, but slightly more spread out and shifted towards the right
  • Most existing customers either spend less on their credit cards or pay off their balances in full each month.
  • No outliers present

Observation on Avg_Open_To_Buy & target¶

In [72]:
# Avg_Open_To_Buy & target
distribution_plot_wrt_target(data, 'Avg_Open_To_Buy', 'Attrition_Flag')

Insight

  • Existing customers either spend a lot on their credit cards or have lower credit limits.
  • Attrited customers, on average, have higher average open-to-buy amounts than existing customers.
  • Both have many outliers

Observation on Total_Amt_Chng_Q4_Q1 & target¶

In [73]:
# Total_Amt_Chng_Q4_Q1 & target
distribution_plot_wrt_target(data, 'Total_Amt_Chng_Q4_Q1', 'Attrition_Flag')

Insight

  • Majority of existing customers having a change of around 0.75. This means that, on average, existing customers did not change their spending significantly from Q4 to Q1.
  • Roughly normal, but with a larger spread and shifted slightly towards higher values. Suggests that attrited customers were more likely to have increased their spending from Q4 to Q1.
  • Many outliers present on existing customers

Total_Trans_Amt & target¶

In [74]:
# Total_Trans_Amt & target
distribution_plot_wrt_target(data, 'Total_Trans_Amt', 'Attrition_Flag')

Insight

  • Majority of existing customers having transaction amounts between \$2,000 and \$5,000. Second peak in the distribution around \$10,000.

  • Existing customers have more higher spending customers.

  • Outliers present in both, and right skewed.

Total_Ct_Chng_Q4_Q1 & target¶

In [75]:
# Total_Ct_Chng_Q4_Q1 & target
distribution_plot_wrt_target(data, 'Total_Ct_Chng_Q4_Q1', 'Attrition_Flag')

Insight

  • Both have a roughly normal distribution
  • Most customers didn't change their transaction count significantly from Q4 to Q1.
  • Outlier present, more on existing customers as they continue to use the products resulting in more changes.

Avg_Utilization_Ratio & target¶

In [76]:
# Avg_Utilization_Ratio & target
distribution_plot_wrt_target(data, 'Avg_Utilization_Ratio', 'Attrition_Flag')

Insight

  • most customers in both groups use a relatively small portion of their available credit.

  • Many Outlier present in the attributed customers


Summary of EDA¶

  • How is the total transaction amount distributed?

    • The distribution of customer transaction amounts is heavily right-skewed, with a significant number of outliers, particularly on the higher end. Most customers (75%) have transaction amounts of $4,500 or less. However, there are distinct groups of customers who tend to spend around $1,500, $4,000, $8,000, and a notable cluster with much higher spending around $150,000.
  • What is the distribution of the level of education of customers?

    • The analysis of customer education levels reveals that "Graduate" is the most common (30.9%), followed by "High School" (19.9%). The least represented groups are "Doctorate" (4.5%) and "Post-Graduate" (5.1%)."Uneducated" or "Unknown" may indicate a mix of inaccurate reporting and customers who preferred not to disclose their educational background
  • What is the distribution of the level of income of customers?
    • Majority of customers have less than 40K income range. The range is from less than 40K to 120K+. 75% of customers fall between 0-80K.
  • How does the change in transaction amount between Q4 and Q1 (total_ct_change_Q4_Q1) vary by the customer's account status (Attrition_Flag)?

    • Median change is slightly above 1.5, with the majority of customers showing a change between 1 and 2. Closed accounts change is slightly below 1.5, with the majority showing a change between 1 and 3.
  • How does the number of months a customer was inactive in the last 12 months (Months_Inactive_12_mon) vary by the customer's account status (Attrition_Flag)?

    • Existing customers have a median of 2 months, with most customers having 1 to 3 months of inactivity. closed accounts having 2 to 4 months of inactivity.
  • What are the attributes that have a strong correlation with each other?

    • Total_Trans_Ct and Total_Trans_CT have a farily strong positive correlation
    • Months on book and customer age have a moderate positive correlation
    • Total revolving balance and Avg Utilization_Ration somewhat positive correlation
    • Most negative relation is Avg Open to Buy and Avg_Utilization_ratio

Data Pre-processing¶

  • Convert Categorical Variables to Numerical: Machine learning algorithms generally require numerical input.
In [77]:
# outlier detection using boxplot
numeric_columns = data.select_dtypes(include=np.number).columns.tolist()


plt.figure(figsize=(15, 12))

for i, variable in enumerate(numeric_columns):
    plt.subplot(4, 4, i + 1)
    plt.boxplot(data[variable], whis=1.5)
    plt.tight_layout()
    plt.title(variable)

plt.show()

Insight

  • There are quite a few outliers in the data
  • No need to treat them as they are proper values
In [78]:
df1 = data.copy()
In [79]:
# Drop Total_Amt_Chng_Q4_Q1_Binned column (it is not needed)
df1 = df1.drop(['Total_Amt_Chng_Q4_Q1_Binned'], axis=1)
In [80]:
# Drop Credit_Limit_Bin (it is not needed)
df1 = df1.drop(['Credit_Limit_Bin'], axis=1)
In [81]:
X = df1.drop(["Attrition_Flag"], axis=1)
y = df1["Attrition_Flag"]
In [82]:
# Splitting data into training, validation and test set:
# first we split data into 2 parts, say temporary and test

X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=1, stratify=y
)

# then we split the temporary set into train and validation

X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=1, stratify=y_temp
)
print(X_train.shape, X_val.shape, X_test.shape)
(6075, 19) (2026, 19) (2026, 19)
In [83]:
print("Number of rows in train data =", X_train.shape[0])
print("Number of rows in validation data =", X_val.shape[0])
print("Number of rows in test data =", X_test.shape[0])
Number of rows in train data = 6075
Number of rows in validation data = 2026
Number of rows in test data = 2026

Split the Data into training, validation and test set¶

Reason why:

  • To build machine learning models that are more robust, generalizable, and ultimately, more useful for solving real-world problems.

  • Prevent overfitting

  • Model has not seen the data and I can check the more generalization approach of the model.

  • Avoid data leakage

Missing values¶

In [84]:
df1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10127 entries, 0 to 10126
Data columns (total 20 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Attrition_Flag            10127 non-null  object 
 1   Customer_Age              10127 non-null  int64  
 2   Gender                    10127 non-null  object 
 3   Dependent_count           10127 non-null  int64  
 4   Education_Level           8608 non-null   object 
 5   Marital_Status            9378 non-null   object 
 6   Income_Category           10127 non-null  object 
 7   Card_Category             10127 non-null  object 
 8   Months_on_book            10127 non-null  int64  
 9   Total_Relationship_Count  10127 non-null  int64  
 10  Months_Inactive_12_mon    10127 non-null  int64  
 11  Contacts_Count_12_mon     10127 non-null  int64  
 12  Credit_Limit              10127 non-null  float64
 13  Total_Revolving_Bal       10127 non-null  int64  
 14  Avg_Open_To_Buy           10127 non-null  float64
 15  Total_Amt_Chng_Q4_Q1      10127 non-null  float64
 16  Total_Trans_Amt           10127 non-null  int64  
 17  Total_Trans_Ct            10127 non-null  int64  
 18  Total_Ct_Chng_Q4_Q1       10127 non-null  float64
 19  Avg_Utilization_Ratio     10127 non-null  float64
dtypes: float64(5), int64(9), object(6)
memory usage: 1.5+ MB
In [85]:
# Define a dictionary to map class labels to 1 or 0
attrition_map = {'Attrited Customer': 1, 'Existing Customer': 0}

# Apply the mapping and convert to integer
y_train = y_train.map(attrition_map).astype(int)
y_val = y_val.map(attrition_map).astype(int)
y_test = y_test.map(attrition_map).astype(int)
In [86]:
# Define a dictionary to map 'M' to 1 and 'F' to 0
gender_map = {'M': 1, 'F': 0}
# Apply the mapping and convert to numeric
data['Gender'] = data['Gender'].map(gender_map).astype(int)

Note

Originally I did not change attrition customer and existing customer to numerical representation but came across issues with the pos_label in the model.

Missing Value Treatment¶

Reason for handling missing values after splitting the data

  • Data leakage prevention

  • Realistic evaluation

  • Consistent statistics

In [87]:
# Checking columns for missing values in train, validation or test sets
print(X_train.isna().sum())
print("-" * 30)
print(X_val.isna().sum())
print("-" * 30)
print(X_test.isna().sum())
Customer_Age                  0
Gender                        0
Dependent_count               0
Education_Level             897
Marital_Status              432
Income_Category               0
Card_Category                 0
Months_on_book                0
Total_Relationship_Count      0
Months_Inactive_12_mon        0
Contacts_Count_12_mon         0
Credit_Limit                  0
Total_Revolving_Bal           0
Avg_Open_To_Buy               0
Total_Amt_Chng_Q4_Q1          0
Total_Trans_Amt               0
Total_Trans_Ct                0
Total_Ct_Chng_Q4_Q1           0
Avg_Utilization_Ratio         0
dtype: int64
------------------------------
Customer_Age                  0
Gender                        0
Dependent_count               0
Education_Level             308
Marital_Status              168
Income_Category               0
Card_Category                 0
Months_on_book                0
Total_Relationship_Count      0
Months_Inactive_12_mon        0
Contacts_Count_12_mon         0
Credit_Limit                  0
Total_Revolving_Bal           0
Avg_Open_To_Buy               0
Total_Amt_Chng_Q4_Q1          0
Total_Trans_Amt               0
Total_Trans_Ct                0
Total_Ct_Chng_Q4_Q1           0
Avg_Utilization_Ratio         0
dtype: int64
------------------------------
Customer_Age                  0
Gender                        0
Dependent_count               0
Education_Level             314
Marital_Status              149
Income_Category               0
Card_Category                 0
Months_on_book                0
Total_Relationship_Count      0
Months_Inactive_12_mon        0
Contacts_Count_12_mon         0
Credit_Limit                  0
Total_Revolving_Bal           0
Avg_Open_To_Buy               0
Total_Amt_Chng_Q4_Q1          0
Total_Trans_Amt               0
Total_Trans_Ct                0
Total_Ct_Chng_Q4_Q1           0
Avg_Utilization_Ratio         0
dtype: int64
In [88]:
import pandas as pd
from sklearn.impute import SimpleImputer

# Get list of categorical and numerical columns
cat_cols = list(X_train.select_dtypes(include='object').columns)
num_cols = list(X_train.select_dtypes(include=['int', 'float']).columns)

# Impute categorical columns
cat_imputer = SimpleImputer(strategy='most_frequent')
X_train[cat_cols] = cat_imputer.fit_transform(X_train[cat_cols])
X_val[cat_cols] = cat_imputer.transform(X_val[cat_cols])
X_test[cat_cols] = cat_imputer.transform(X_test[cat_cols])

# Impute numerical columns
num_imputer = SimpleImputer(strategy='mean')
X_train[num_cols] = num_imputer.fit_transform(X_train[num_cols])
X_val[num_cols] = num_imputer.transform(X_val[num_cols])
X_test[num_cols] = num_imputer.transform(X_test[num_cols])
In [89]:
# Checking that no column has missing values in train, validation or test sets
print(X_train.isna().sum())
print("-" * 30)
print(X_val.isna().sum())
print("-" * 30)
print(X_test.isna().sum())
Customer_Age                0
Gender                      0
Dependent_count             0
Education_Level             0
Marital_Status              0
Income_Category             0
Card_Category               0
Months_on_book              0
Total_Relationship_Count    0
Months_Inactive_12_mon      0
Contacts_Count_12_mon       0
Credit_Limit                0
Total_Revolving_Bal         0
Avg_Open_To_Buy             0
Total_Amt_Chng_Q4_Q1        0
Total_Trans_Amt             0
Total_Trans_Ct              0
Total_Ct_Chng_Q4_Q1         0
Avg_Utilization_Ratio       0
dtype: int64
------------------------------
Customer_Age                0
Gender                      0
Dependent_count             0
Education_Level             0
Marital_Status              0
Income_Category             0
Card_Category               0
Months_on_book              0
Total_Relationship_Count    0
Months_Inactive_12_mon      0
Contacts_Count_12_mon       0
Credit_Limit                0
Total_Revolving_Bal         0
Avg_Open_To_Buy             0
Total_Amt_Chng_Q4_Q1        0
Total_Trans_Amt             0
Total_Trans_Ct              0
Total_Ct_Chng_Q4_Q1         0
Avg_Utilization_Ratio       0
dtype: int64
------------------------------
Customer_Age                0
Gender                      0
Dependent_count             0
Education_Level             0
Marital_Status              0
Income_Category             0
Card_Category               0
Months_on_book              0
Total_Relationship_Count    0
Months_Inactive_12_mon      0
Contacts_Count_12_mon       0
Credit_Limit                0
Total_Revolving_Bal         0
Avg_Open_To_Buy             0
Total_Amt_Chng_Q4_Q1        0
Total_Trans_Amt             0
Total_Trans_Ct              0
Total_Ct_Chng_Q4_Q1         0
Avg_Utilization_Ratio       0
dtype: int64

All missing vlaues have been treated

Train Dataset¶

In [90]:
cols = X_train.select_dtypes(include=["object", "category"])
for i in cols.columns:
    print(X_train[i].value_counts())
    print("*" * 30)
Gender
F    3205
M    2870
Name: count, dtype: int64
******************************
Education_Level
Graduate         2763
High School      1212
Uneducated        928
College           598
Post-Graduate     299
Doctorate         275
Name: count, dtype: int64
******************************
Marital_Status
Married     3280
Single      2346
Divorced     449
Name: count, dtype: int64
******************************
Income_Category
Less than $40K    2130
$40K - $60K       1076
$80K - $120K       918
$60K - $80K        847
Unknown            668
$120K +            436
Name: count, dtype: int64
******************************
Card_Category
Blue        5668
Silver       327
Gold          71
Platinum       9
Name: count, dtype: int64
******************************

Validation Dataset¶

In [91]:
cols = X_val.select_dtypes(include=["object", "category"])
for i in cols.columns:
    print(X_val[i].value_counts())
    print("*" * 30)
Gender
F    1106
M     920
Name: count, dtype: int64
******************************
Education_Level
Graduate         935
High School      399
Uneducated       278
College          217
Post-Graduate    108
Doctorate         89
Name: count, dtype: int64
******************************
Marital_Status
Married     1057
Single       814
Divorced     155
Name: count, dtype: int64
******************************
Income_Category
Less than $40K    712
$40K - $60K       340
$80K - $120K      306
$60K - $80K       280
Unknown           248
$120K +           140
Name: count, dtype: int64
******************************
Card_Category
Blue        1893
Silver       108
Gold          19
Platinum       6
Name: count, dtype: int64
******************************

Test Dataset¶

In [92]:
cols = X_test.select_dtypes(include=["object", "category"])
for i in cols.columns:
    print(X_test[i].value_counts())
    print("*" * 30)
Gender
F    1047
M     979
Name: count, dtype: int64
******************************
Education_Level
Graduate         949
High School      402
Uneducated       281
College          198
Post-Graduate    109
Doctorate         87
Name: count, dtype: int64
******************************
Marital_Status
Married     1099
Single       783
Divorced     144
Name: count, dtype: int64
******************************
Income_Category
Less than $40K    719
$40K - $60K       374
$80K - $120K      311
$60K - $80K       275
Unknown           196
$120K +           151
Name: count, dtype: int64
******************************
Card_Category
Blue        1875
Silver       120
Gold          26
Platinum       5
Name: count, dtype: int64
******************************

Create Dummy Variables¶

In [93]:
X_train = pd.get_dummies(X_train, drop_first=True)
X_val = pd.get_dummies(X_val, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first=True)
print(X_train.shape, X_val.shape, X_test.shape)
(6075, 30) (2026, 30) (2026, 30)

After encoding there are 30 columns.

Model Building¶

Functions for Models¶

In [94]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
        index=[0],
    )

    return df_perf
In [95]:
def confusion_matrix_sklearn(model, predictors, target):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

Model evaluation criterion¶

The nature of predictions made by the classification model will translate as follows:

  • True positives (TP) are failures correctly predicted by the model.
  • False negatives (FN) are real failures in a generator where there is no detection by model.
  • False positives (FP) are failure detections in a generator where there is no failure.

Potential types of loses

  • False Negatives (Loss of Revenue and Customer Lifetime Value): If the model fails to identify a customer who is likely to churn (a false negative), the bank loses out on the potential revenue that customer would have generated through fees, interest, and other financial products.
  • False Positives (Inefficient Resource Allocation): While less critical than false negatives, false positives (predicting a customer will churn when they actually won't) can still be costly.

Which loss is greater?

The loss associated with false negatives (losing customers) far outweighs the loss associated with false positives (unnecessary retention efforts).

Primary Metric: Recall (Sensitivity)

  • Definition: Recall is the ratio of true positives (correctly predicted failures) to the total number of actual failures.
  • Formula: Recall = TP / (TP + FN)

  • Business Justification: The bank's priority is to identify as many potential customer churn cases as possible (minimize false negatives). Missing a customer who is likely to leave is costly due to lost revenue and the potential expense of reacquiring them. Recall directly measures the model's ability to capture these at-risk customers. Also, it is more cost effective to retain existing customers than acquiring new customers.

Additonal:

  • F1 score. The F1 score is the harmonic mean of precision and recall. It's a good metric to consider when you want to balance both minimizing false negatives and false positives.

  • Precision-Recall Trade-off: While recall is crucial, banks also need to consider precision (the accuracy of positive predictions). A model with low precision might lead to unnecessary and costly retention efforts on customers who weren't actually going to churn. Therefore, finding the right balance between recall and precision is important.

Initial Building model¶

Reason for building initial models

  • Baseline Performance
  • Feature Importance
  • Model Comparison
  • Iterative Improvement

Why these models?

  • Decision Tree (dtree), a good starting point due to its simplicity and interpretability.

  • Bagging (BaggingClassifier), since Decision Trees are prone to overfitting, Bagging can be a good way to improve their generalization performance.

  • Random Forest (RandomForestClassifier), often a top performer on a variety of tasks, including classification problems like customer churn.

  • GBM (GradientBoostingClassifier), given the potential complexity of the customer churn problem, GBM is a good candidate to see if it can outperform simpler models.

  • AdaBoost (AdaBoostClassifier), worth trying AdaBoost to see if it can provide a different performance profile compared to GBM.

Why class_weight balanced?

  • The class weights are adjusted to address the class imbalance in the dataset since we are concerned with the recall metric.

Model Building with original data¶

In [96]:
models = []  # Empty list to store all the models

# Appending models into the list
models.append(("Bagging", BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=1, class_weight='balanced'), random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1, class_weight='balanced')))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("dtree", DecisionTreeClassifier(random_state=1, class_weight='balanced')))

# Train and evaluate the models
print("\n" "Training Performance:" "\n")
for name, model in models:
    model.fit(X_train, y_train)
    scores_train = recall_score(y_train, model.predict(X_train))
    print("{}: {}".format(name, scores_train))

print("\n" "Validation Performance:" "\n")
for name, model in models:
    model.fit(X_train, y_train)
    scores_val = recall_score(y_val, model.predict(X_val))
    print("{}: {}".format(name, scores_val))
Training Performance:

Bagging: 0.9846311475409836
Random forest: 1.0
GBM: 0.9108606557377049
Adaboost: 0.8504098360655737
dtree: 1.0

Validation Performance:

Bagging: 0.7576687116564417
Random forest: 0.6625766871165644
GBM: 0.8190184049079755
Adaboost: 0.8128834355828221
dtree: 0.7300613496932515

Insight

  • Best Model: Bagging and GBM

  • Reasoning: Both models achieved a high recall on the validation set (0.8037 and 0.8190, respectively), indicating their ability to identify a good proportion of customers likely to churn.

Model Building with Oversampled data¶

Reason for this approach

  • Create new synthetic samples based on the existing minority class data. Given that the dataset exhibits a significant class imbalance between "Attrited Customer" (minority class) and "Existing Customer" (majority class), oversampling can be a valuable technique to improve the churn prediction model.
In [97]:
# Print counts before oversampling
print("Before Oversampling, counts of label 'Attrited Customer': {}".format(sum(y_train == 1)))
print("Before Oversampling, counts of label 'Existing Customer': {} \n".format(sum(y_train == 0)))

# Synthetic Minority Over Sampling Technique
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)

# Print counts after oversampling
print("After Oversampling, counts of label 'Attrited Customer': {}".format(sum(y_train_over == 1)))
print("After Oversampling, counts of label 'Existing Customer': {} \n".format(sum(y_train_over == 0)))

print("After Oversampling, the shape of train_X: {}".format(X_train_over.shape))
print("After Oversampling, the shape of train_y: {} \n".format(y_train_over.shape))
Before Oversampling, counts of label 'Attrited Customer': 976
Before Oversampling, counts of label 'Existing Customer': 5099 

After Oversampling, counts of label 'Attrited Customer': 5099
After Oversampling, counts of label 'Existing Customer': 5099 

After Oversampling, the shape of train_X: (10198, 30)
After Oversampling, the shape of train_y: (10198,) 

In [98]:
models = []  # Empty list to store all the models

# Appending models into the list
models.append(("Bagging", BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=1, class_weight='balanced'), random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1, class_weight='balanced')))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("dtree", DecisionTreeClassifier(random_state=1, class_weight='balanced')))

# Training Performance
print("\n" "Training Performance:" "\n")
for name, model in models:
    model.fit(X_train_over, y_train_over)
    scores = recall_score(y_train_over, model.predict(X_train_over))  # Specify pos_label
    print("{}: {}".format(name, scores))

print("\n" "Validation Performance:" "\n")

for name, model in models:
    model.fit(X_train_over, y_train_over)
    scores_val = recall_score(y_val, model.predict(X_val))  # Specify pos_label
    print("{}: {}".format(name, scores_val))
Training Performance:

Bagging: 0.9972543636007061
Random forest: 1.0
GBM: 0.9847028829182193
Adaboost: 0.9731319866640518
dtree: 1.0

Validation Performance:

Bagging: 0.8282208588957055
Random forest: 0.7791411042944786
GBM: 0.8680981595092024
Adaboost: 0.8466257668711656
dtree: 0.803680981595092

Insight

  • Best Model: Bagging and GBM

  • Reasoning: These models again showed the best recall on the validation set (0.8742 and 0.8650), even after oversampling. This indicates that they can handle balanced data effectively while maintaining good performance on imbalanced data.

Model Building with Undersampled data¶

Reason for use?

The dataset is imbalanced with significantly more "Existing Customers" than "Attrited Customers". This is crucial because machine learning models tend to be biased towards the majority class in imbalanced datasets, leading to poor performance in identifying the minority class ("Attrited Customers").

In [99]:
# Print counts before undersampling
print("Before Undersampling, counts of label 'Attrited Customer': {}".format(sum(y_train == 1)))
print("Before Undersampling, counts of label 'Existing Customer': {} \n".format(sum(y_train == 0)))

# Random undersampler for under sampling the data
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)

# Print counts after undersampling
print("After Undersampling, counts of label 'Attrited Customer': {}".format(sum(y_train_un == 1)))
print("After Undersampling, counts of label 'Existing Customer': {} \n".format(sum(y_train_un == 0)))

print("After Undersampling, the shape of train_X: {}".format(X_train_un.shape))
print("After Undersampling, the shape of train_y: {} \n".format(y_train_un.shape))
Before Undersampling, counts of label 'Attrited Customer': 976
Before Undersampling, counts of label 'Existing Customer': 5099 

After Undersampling, counts of label 'Attrited Customer': 976
After Undersampling, counts of label 'Existing Customer': 976 

After Undersampling, the shape of train_X: (1952, 30)
After Undersampling, the shape of train_y: (1952,) 

In [100]:
from sklearn.metrics import recall_score

models = []  # Empty list to store all the models

# Appending models into the list
models.append(("Bagging", BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=1, class_weight='balanced'), random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1, class_weight='balanced')))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("dtree", DecisionTreeClassifier(random_state=1, class_weight='balanced')))


print("\n" "Training Performance:" "\n")
for name, model in models:
    model.fit(X_train_un, y_train_un)
    scores = recall_score(y_train_un, model.predict(X_train_un))  # pos_label added
    print("{}: {}".format(name, scores))

print("\n" "Validation Performance:" "\n")

for name, model in models:
    model.fit(X_train_un, y_train_un)
    scores_val = recall_score(y_val, model.predict(X_val))  # pos_label added
    print("{}: {}".format(name, scores_val))
Training Performance:

Bagging: 0.9907786885245902
Random forest: 1.0
GBM: 0.9784836065573771
Adaboost: 0.951844262295082
dtree: 1.0

Validation Performance:

Bagging: 0.8650306748466258
Random forest: 0.9079754601226994
GBM: 0.9294478527607362
Adaboost: 0.9294478527607362
dtree: 0.8650306748466258

Insight

  • Best Models: Bagging, Random Forest, GBM, and AdaBoost

  • Reasoning: All four models achieved very high recall scores on both the training and validation sets (above 0.90). While the high training recall suggests potential overfitting, the excellent validation recall indicates these models generalize well to imbalanced data.

Model Overview Results¶

  • Oversampling vs. Undersampling: Both techniques improved validation recall compared to the original models, indicating they were effective in addressing the class imbalance.

  • Overall, the models trained on the undersampled data generally outperformed the oversampled models. The Bagging, Random Forest, GBM, and AdaBoost models trained on the undersampled data show the highest validation recall scores and less severe overfitting.

  • However, it's important to note that the high training recall scores for the undersampled models indicate some degree of overfitting, which need further refinement.


HyperparameterTuning¶

Reason for use?

  • Hyperparameter tuning can help you find the optimal configuration for each model (Bagging, Random Forest, GBM, AdaBoost, Decision Tree) to improve its ability to predict churn accurately.

  • It can help reduce overfitting, which is evident in the current results the models perform much better on the training data than on the validation data.

  • Continue to tune prior 3 models using the same data (undersampled or oversampled) as we trained them on before

Tuning AdaBoostClassifier model with Undersampled data¶

Reason for use?

AdaBoost is known for its sensitivity to class imbalances. By providing it with a balanced dataset through undersampling, we can enhance its ability to learn patterns from both the "Attrited" and "Existing" classes more effectively.

In [101]:
%%time
import sklearn.metrics as metrics
# defining model
Model = AdaBoostClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {
    "n_estimators": np.arange(10, 40, 10),
    "learning_rate": [0.1, 0.01, 0.2, 0.05, 1],
    "base_estimator": [
        DecisionTreeClassifier(max_depth=1, random_state=1),
        DecisionTreeClassifier(max_depth=2, random_state=1),
        DecisionTreeClassifier(max_depth=3, random_state=1),
    ],
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_jobs = -1, n_iter=50, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un, y_train_un)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 30, 'learning_rate': 0.1, 'base_estimator': DecisionTreeClassifier(max_depth=3, random_state=1)} with CV score=0.9487650444793303:
CPU times: user 1.55 s, sys: 146 ms, total: 1.69 s
Wall time: 30.9 s

Insight into model's best parameters

  • n_estimators: 30 - This means the model uses 30 decision trees as base learners.

  • learning_rate: 0.1 - This controls how much each tree contributes to the overall model. A lower learning rate means each tree has a smaller impact, which can help prevent overfitting but might require more trees to achieve good performance.

  • base_estimator: DecisionTreeClassifier(max_depth=3, random_state=1) - This specifies that the base learners are decision trees with a maximum depth of 3. A smaller depth helps control the complexity of each tree, again reducing the risk of overfitting.

In [102]:
tuned_adb = AdaBoostClassifier(
    random_state=1,
    n_estimators=30,
    learning_rate=0.1,
    base_estimator=DecisionTreeClassifier(max_depth=3, random_state=1),
)
tuned_adb.fit(X_train_un, y_train_un)
Out[102]:
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
                                                         random_state=1),
                   learning_rate=0.1, n_estimators=30, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
                                                         random_state=1),
                   learning_rate=0.1, n_estimators=30, random_state=1)
DecisionTreeClassifier(max_depth=3, random_state=1)
DecisionTreeClassifier(max_depth=3, random_state=1)
In [103]:
# Checking model's performance on training set
adb_train = model_performance_classification_sklearn(tuned_adb, X_train_un, y_train_un)
adb_train
Out[103]:
Accuracy Recall Precision F1
0 0.953 0.964 0.943 0.953

Insight

  • Model performs well on the training set, achieving high accuracy for all measurements.

  • Meaning model can accurately identify customers who have churned within training set

In [104]:
# Checking model's performance on validation set
adb_val = model_performance_classification_sklearn(tuned_adb, X_val, y_val)
adb_val
Out[104]:
Accuracy Recall Precision F1
0 0.927 0.933 0.709 0.805
  • Model's performance drops on the validation set. This suggests overfitting, as the model has learned too well on the training data.
  • Recall remained high with 93.3%
  • Too much noise was captured and is not generalizing well on the unseen data.

Tuning AdaBoostClassifier model with Undersampled data conclusion¶

  • Main issues with the model were overftting and low scores for generlization on the validation set. This is crucial for a churn prediction model.

Address overfitting

  • Simply the model
  • Collect more data
  • Hyperparameter tuning

Tuning Gradient Boosting model with Undersampled Data¶

Reason for use?

  • Gradient Boosting is a powerful machine learning technique used for classification tasks.

  • Undersampling is a technique for dealing with class imbalance in datasets, where one class has significantly more samples than the others. In the context of customer churn, existing Customers usually far outnumber Attrited Customers

In [105]:
%%time

#Creating pipeline
Model = GradientBoostingClassifier(random_state=1)

#Parameter grid to pass in RandomSearchCV
param_grid = {
    "init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
    "n_estimators": np.arange(125,175,25),
    "learning_rate": [0.01, 0.2, 0.05, 1],
    "subsample":[0.8,0.9,1],
    "max_features":[0.5,0.7,1],
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1, n_jobs = -1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.9, 'n_estimators': 150, 'max_features': 0.5, 'learning_rate': 0.05, 'init': AdaBoostClassifier(random_state=1)} with CV score=0.9538932496075354:
CPU times: user 2.96 s, sys: 251 ms, total: 3.21 s
Wall time: 2min 8s

Insight in model's best parameters

  • subsample: 0.8 - This introduces randomness and helps prevent overfitting.
  • n_estimators: 150 - The model uses 150 decision trees as weak learners.
  • max_features: 0.7 - This further reduces overfitting and improves model robustness.
  • learning_rate: 0.05 - A lower learning rate means slower learning but potentially better generalization.
In [106]:
tuned_gbm1 = GradientBoostingClassifier(
    random_state=1,
    subsample=0.9,
    n_estimators=150,
    max_features=0.5,
    learning_rate=0.05,
    init=AdaBoostClassifier(random_state=1),
)
tuned_gbm1.fit(X_train_un, y_train_un)
Out[106]:
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
                           learning_rate=0.05, max_features=0.5,
                           n_estimators=150, random_state=1, subsample=0.9)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
                           learning_rate=0.05, max_features=0.5,
                           n_estimators=150, random_state=1, subsample=0.9)
AdaBoostClassifier(random_state=1)
AdaBoostClassifier(random_state=1)
In [107]:
# Checking model's performance on training set
gbm1_train = model_performance_classification_sklearn(
    tuned_gbm1, X_train_un, y_train_un
)
gbm1_train
Out[107]:
Accuracy Recall Precision F1
0 0.966 0.975 0.958 0.966

Insight

  • Performing well on the training set
  • Recall: 97.5% of the actual attrited customers were correctly identified.
In [108]:
# Checking model's performance on validation set
gbm1_val = model_performance_classification_sklearn(tuned_gbm1, X_val, y_val)
gbm1_val
Out[108]:
Accuracy Recall Precision F1
0 0.942 0.914 0.770 0.836

Insight

  • Model perfroming well but still overfitting
  • Recall: 91.4% of the actual attrited customers in the validation set were correctly identified.
  • Precision and F1 score dropped the most

Tuning Gradient Boosting model with Undersampled Data Conclusion¶

  • High Recall is a Priority: In the context of churn prediction, recall is often prioritized because the cost of missing a potential churner (false negative) is higher than incorrectly identifying someone as likely to churn (false positive). The model's high recall on both sets is a positive sign, as it suggests the model is effectively capturing most of the customers who will actually churn.

Tuning Gradient Boosting model with Oversampled data¶

Reason for use?

  • Potentially build a more accurate, effective, and business-oriented churn prediction model that prioritizes the identification of at-risk customers.

  • Recall improvements as the model can use oversampling to boost the model's recall.

  • Enhanced learning of the model

In [109]:
%%time

#defining model
Model = GradientBoostingClassifier(random_state=1)

#Parameter grid to pass in RandomSearchCV
param_grid = {
    "init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
    "n_estimators": np.arange(75,150,25),
    "learning_rate": [0.1, 0.01, 0.2, 0.05, 1],
    "subsample":[0.5,0.7,1],
    "max_features":[0.5,0.7,1],
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1, n_jobs = -1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over, y_train_over)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.5, 'n_estimators': 100, 'max_features': 0.7, 'learning_rate': 0.01, 'init': AdaBoostClassifier(random_state=1)} with CV score=0.9541109122746253:
CPU times: user 6.86 s, sys: 659 ms, total: 7.52 s
Wall time: 6min 44s

Insight in model's best parameters

  • subsample: 0.5 - This introduces randomness and helps prevent overfitting by reducing the variance of the model.

  • n_estimators: 100 - Increasing the number of estimators can improve model performance.

  • max_features: 0.7 - This randomness helps to reduce correlation between the trees, further preventing overfitting and improving generalization.

  • learning_rate: 0.01 - This is a small learning rate, which means each tree contributes a small amount to the overall model's predictions.

In [110]:
tuned_gbm2 = GradientBoostingClassifier(
    random_state=1,
    subsample=0.5,
    n_estimators=100,
    max_features=0.7,
    learning_rate=0.01,
    init=AdaBoostClassifier(random_state=1),
)
tuned_gbm2.fit(X_train_over, y_train_over)
Out[110]:
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
                           learning_rate=0.01, max_features=0.7, random_state=1,
                           subsample=0.5)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
                           learning_rate=0.01, max_features=0.7, random_state=1,
                           subsample=0.5)
AdaBoostClassifier(random_state=1)
AdaBoostClassifier(random_state=1)
In [111]:
# Checking model's performance on training set
gbm2_train = model_performance_classification_sklearn(tuned_gbm2, X_train_over, y_train_over)
gbm2_train
Out[111]:
Accuracy Recall Precision F1
0 0.934 0.963 0.909 0.935

Insight

  • Model performs well on the training set
  • Recall: 96.3% of the actual attrited customers were correctly identified.
In [112]:
# Checking model's performance on validation set
gbm2_val = model_performance_classification_sklearn(tuned_gbm2, X_val, y_val)
gbm2_val
Out[112]:
Accuracy Recall Precision F1
0 0.914 0.899 0.674 0.770

Insight

  • Model perfoming well but suggests overfitting
  • Recall: 89.9% of the actual attrited customers in the validation set were correctly identified.
  • Precision and F1 Score dropped the most between training and validation

Tuning Gradient Boosting model with Oversampled Data Conclusion¶

  • Strong performance, but some overfitting: The model performs very well on both the training and validation sets, demonstrating high accuracy and recall.

Model Comparison and Final Model Selection¶

Reason for comparison?

  • Different machine learning algorithms have different strengths and weaknesses.

  • Optimizing for business goals

  • Avoiding overfitting

  • Robustness and reliability

In [113]:
# training performance comparison

models_train_comp_df = pd.concat(
    [
        gbm1_train.T,
        gbm2_train.T,
        adb_train.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Gradient boosting trained with Undersampled data",
    "Gradient boosting trained with Oversampled data",
    "AdaBoost trained with Undersampled data",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
Out[113]:
Gradient boosting trained with Undersampled data Gradient boosting trained with Oversampled data AdaBoost trained with Undersampled data
Accuracy 0.966 0.934 0.953
Recall 0.975 0.963 0.964
Precision 0.958 0.909 0.943
F1 0.966 0.935 0.953
In [114]:
# Validation performance comparison

models_train_comp_df = pd.concat(
    [ gbm1_val.T, gbm2_val.T, adb_val.T], axis=1,
)
models_train_comp_df.columns = [
    "Gradient boosting trained with Undersampled data",
    "Gradient boosting trained with Oversampled data",
    "AdaBoost trained with Undersampled data",
]
print("Validation performance comparison:")
models_train_comp_df
Validation performance comparison:
Out[114]:
Gradient boosting trained with Undersampled data Gradient boosting trained with Oversampled data AdaBoost trained with Undersampled data
Accuracy 0.942 0.914 0.927
Recall 0.914 0.899 0.933
Precision 0.770 0.674 0.709
F1 0.836 0.770 0.805

Insight

  • AdaBoost trained with Undersampled data has the best recall score of 0.933, meaning it correctly identifies the highest percentage of customers who are likely to churn. However, it's important to consider the trade-off with precision, which is slightly lower for AdaBoost compared to the Gradient Boosting model trained with undersampled data.
  • If the priority is to maximize the identification of potential churners (even at the cost of some false positives), then AdaBoost might be the preferred choice.

  • If a balance between identifying churners and minimizing incorrect predictions is desired, then the Gradient Boosting model with undersampled data could be a better option due to its higher F1 score.

Final thought

  • Use AdaBoost with undersampled model to best predict if a customer will churn.
In [115]:
# Checking model's performance on test set
tuned_adb_test = model_performance_classification_sklearn(tuned_adb, X_test, y_test)
tuned_adb_test
Out[115]:
Accuracy Recall Precision F1
0 0.928 0.938 0.708 0.807

Insight

  • Recall: 93.8% of the actual "Attrited Customers" in the test set were correctly identified.

  • F1 Score: The F1 score of 0.807 provides a balanced measure of the model's precision and recall.

Overall, the final model exhibits strong performance, particularly in its high recall rate. For Thera Bank as it allows them to proactively identify and target customers who are most likely to churn. However, the lower precision indicates that the model may be flagging some customers as potential churn risks who might not actually leave.

Further actions

  • Analyze false positive
  • Fine-tune hyperparamters to balance out precision and recall
In [116]:
feature_names = X_train.columns
importances = tuned_adb.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

Insight

The top features driving churn predictors

  • Total_Trans_Amt: Total Transaction Amount (Last 12 months)
  • Total_Trans_Ct: Total Transaction Count (Last 12 months)
  • Total_Revolving_Bal: Total Revolving Balance on the Credit Card

This suggests that a customer's overall transaction behavior, including the number and amount of transactions and the outstanding balance, plays a crucial role in predicting their likelihood to churn.

Least important features

Surprisingly the categorical features showed the least feature importance. indicating that these individual categories don't have a strong direct impact on churn prediction.

  • Age
  • Gender
  • Dependent count
  • Education level
  • Marital status

It suggests that the bank should focus on understanding and managing customers' transaction behavior, relationship depth, and recent changes in spending patterns to effectively reduce churn.


Business Insights and Conclusions¶

Based on the final model analysis, we can derive the following business insights and recommendations for Thera Bank.

  • Transaction behavior is critical

    • The most important features for predicting customer churn are related to transaction behavior: total transaction count, total transaction amount, and total revolving balance. This means customers who are not actively using their credit cards or who are carrying high balances are more likely to churn.
  • Changes in spending matters

    • The total amount and count changes from Q4 to Q1 also play a significant role. Customers who drastically reduce their spending or transaction frequency may be signaling dissatisfaction or changing financial situations, increasing their risk of churn.
  • Relationships matter

    • The total number of products held by the customer has moderate importance in predicting churn. This suggests that customers with deeper relationships with the bank (holding multiple products) are more likely to stay.
  • Demographics less influential

    • Surprisingly, demographic factors like age, gender, and education level, along with income category and card category, have relatively low importance. This implies that churn behavior is less influenced by these factors and more by actual usage and financial patterns.

Recommendations¶

Proactive engagement

  • Thera Bank should focus on proactively engaging customers who exhibit low transaction activity or high revolving balances. This could involve targeted marketing campaigns, personalized offers, or financial education resources to encourage responsible credit card usage.

Monitoring spending patterns

  • The bank should closely monitor changes in customers' spending and transaction patterns. Sudden drops in activity could be early warning signs of potential churn. Reaching out to these customers with personalized communication or tailored offers could help prevent them from leaving.

Rewarding loyalty

  • To strengthen customer relationships, the bank could develop loyalty programs or rewards systems that incentivize customers to use their credit cards more frequently and maintain healthy balances.

Cross-selling opportunities

  • Since the number of products held by the customer is a significant factor, the bank should explore cross-selling opportunities to deepen relationships and increase customer engagement.

Additonal Comments¶

Data collection and analysis

  • While demographics seem less important, the bank should continue collecting this data and analyzing it in conjunction with other factors to uncover any potential hidden patterns or interactions.

Model refinement

  • The current GBM model performs well but could be further refined. The bank should continue to monitor the model's performance, experiment with different algorithms and hyperparameters, and incorporate new data as it becomes available to ensure the model remains accurate and effective over time.

Revisiting initial Hypothesis¶

The analysis reveals that transaction behavior and engagement metrics (like total transaction count and changes in spending) are more significant predictors of churn than demographic factors alone.

Hypotheses that were partially supported:

  • Hypothesis 4 Customers with longer relationships with the bank are less likely to churn. This might be partially true, as "Months_on_book" wasn't a top predictor, but it could still interact with other factors.

  • Hypothesis 5 Customers with more products at the bank are more likely to stay due to higher engagement. This has some support, as "Total_Relationship_Count" was identified as a moderately important feature.

  • Hypothesis 6 Frequent contact with the bank (e.g., customer service) might reduce churn. This wasn't strongly supported in the feature importance analysis.



Additional EDA Analysis¶

In [117]:
#Pull up info
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10127 entries, 0 to 10126
Data columns (total 20 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Attrition_Flag            10127 non-null  object 
 1   Customer_Age              10127 non-null  int64  
 2   Gender                    10127 non-null  object 
 3   Dependent_count           10127 non-null  int64  
 4   Education_Level           8608 non-null   object 
 5   Marital_Status            9378 non-null   object 
 6   Income_Category           10127 non-null  object 
 7   Card_Category             10127 non-null  object 
 8   Months_on_book            10127 non-null  int64  
 9   Total_Relationship_Count  10127 non-null  int64  
 10  Months_Inactive_12_mon    10127 non-null  int64  
 11  Contacts_Count_12_mon     10127 non-null  int64  
 12  Credit_Limit              10127 non-null  float64
 13  Total_Revolving_Bal       10127 non-null  int64  
 14  Avg_Open_To_Buy           10127 non-null  float64
 15  Total_Amt_Chng_Q4_Q1      10127 non-null  float64
 16  Total_Trans_Amt           10127 non-null  int64  
 17  Total_Trans_Ct            10127 non-null  int64  
 18  Total_Ct_Chng_Q4_Q1       10127 non-null  float64
 19  Avg_Utilization_Ratio     10127 non-null  float64
dtypes: float64(5), int64(9), object(6)
memory usage: 1.5+ MB

Bivariate Analysis¶

In [118]:
sns.pairplot(data=df, diag_kind="kde")
plt.show()

Insight

  • Credit limit and Avg Open to buy have positive correlation. as a customer's credit limit increases, their average open-to-buy amount also tends to increase.

Observation on Customer Age & Education¶

In [119]:
# Boxplot age & Education
plt.figure(figsize=(12, 6))
sns.boxplot(x='Education_Level', y='Customer_Age', data=df)
plt.xticks(rotation=45)
plt.show()

Insight

  • Median is similiar among age and education types.
  • Outlier present in high school category

Observation on Customer age & Card Category¶

In [120]:
# Boxplot age & Card category
plt.figure(figsize=(12, 6))
sns.boxplot(x='Customer_Age', y='Card_Category', data=df)
plt.xticks(rotation=45)
plt.show()

Insight

  • Median age of 45 among all card categories
  • Few outliers present for Blue, Gold and silver
  • Platinum has the short age range of cusotmers.

Observation on Total_Trans_Amt & Education¶

In [121]:
# Total_Trans_Amt & Education
plt.figure(figsize=(12, 6))
sns.boxplot(x='Education_Level', y='Total_Trans_Amt', data=df)
plt.xticks(rotation=45)
plt.show()

Insight

  • spending among education levels are similiar.
  • Outlier present in all levels

Marital Status & Total_Trans_Amt¶

In [122]:
# Marital Status & Total_Trans_Amt
plt.figure(figsize=(12, 6))
sns.boxplot(x='Marital_Status', y='Total_Trans_Amt', data=df)
plt.xticks(rotation=45)
plt.show()

Insight

  • Total transaction amount does not vary much between marital status
  • Outliers are present in all
In [123]:
# Age & Total_Ct_Chng_Q4_Q1
plt.figure(figsize=(12, 6))
sns.boxplot(x='Customer_Age', y='Total_Ct_Chng_Q4_Q1', data=df)
plt.xticks(rotation=45)
plt.show()

Insight

  • Ages ranges from 26 to 73
  • Most ages from 29 to 61 have a similar median for total change from Q1 to Q4
  • Most changes are for ages 66 to 68
  • More outliers for 40 age range from Q1 to Q4
In [124]:
# Age & Total_Trans_Amt
plt.figure(figsize=(12, 6))
sns.boxplot(x='Customer_Age', y='Total_Trans_Amt', data=df)
plt.xticks(rotation=45)
plt.show()

Insight

  • Ages 29-32 have a higher total transaction amount
  • Ages 40 - 52 have similiar median total transaction amount
  • Outliers present the most for ages 39-61


In [125]:
# Convert ipynb to HTML
In [126]:
!jupyter nbconvert --to html /content/AML_Project_LearnerNotebook_FullCode_Final.ipynb
[NbConvertApp] WARNING | pattern '/content/AML_Project_LearnerNotebook_FullCode_Final.ipynb' matched no files
This application is used to convert notebook files (*.ipynb)
        to various other formats.

        WARNING: THE COMMANDLINE INTERFACE MAY CHANGE IN FUTURE RELEASES.

Options
=======
The options below are convenience aliases to configurable class-options,
as listed in the "Equivalent to" description-line of the aliases.
To see all configurable class-options for some <cmd>, use:
    <cmd> --help-all

--debug
    set log level to logging.DEBUG (maximize logging output)
    Equivalent to: [--Application.log_level=10]
--show-config
    Show the application's configuration (human-readable format)
    Equivalent to: [--Application.show_config=True]
--show-config-json
    Show the application's configuration (json format)
    Equivalent to: [--Application.show_config_json=True]
--generate-config
    generate default config file
    Equivalent to: [--JupyterApp.generate_config=True]
-y
    Answer yes to any questions instead of prompting.
    Equivalent to: [--JupyterApp.answer_yes=True]
--execute
    Execute the notebook prior to export.
    Equivalent to: [--ExecutePreprocessor.enabled=True]
--allow-errors
    Continue notebook execution even if one of the cells throws an error and include the error message in the cell output (the default behaviour is to abort conversion). This flag is only relevant if '--execute' was specified, too.
    Equivalent to: [--ExecutePreprocessor.allow_errors=True]
--stdin
    read a single notebook file from stdin. Write the resulting notebook with default basename 'notebook.*'
    Equivalent to: [--NbConvertApp.from_stdin=True]
--stdout
    Write notebook output to stdout instead of files.
    Equivalent to: [--NbConvertApp.writer_class=StdoutWriter]
--inplace
    Run nbconvert in place, overwriting the existing notebook (only
            relevant when converting to notebook format)
    Equivalent to: [--NbConvertApp.use_output_suffix=False --NbConvertApp.export_format=notebook --FilesWriter.build_directory=]
--clear-output
    Clear output of current file and save in place,
            overwriting the existing notebook.
    Equivalent to: [--NbConvertApp.use_output_suffix=False --NbConvertApp.export_format=notebook --FilesWriter.build_directory= --ClearOutputPreprocessor.enabled=True]
--no-prompt
    Exclude input and output prompts from converted document.
    Equivalent to: [--TemplateExporter.exclude_input_prompt=True --TemplateExporter.exclude_output_prompt=True]
--no-input
    Exclude input cells and output prompts from converted document.
            This mode is ideal for generating code-free reports.
    Equivalent to: [--TemplateExporter.exclude_output_prompt=True --TemplateExporter.exclude_input=True --TemplateExporter.exclude_input_prompt=True]
--allow-chromium-download
    Whether to allow downloading chromium if no suitable version is found on the system.
    Equivalent to: [--WebPDFExporter.allow_chromium_download=True]
--disable-chromium-sandbox
    Disable chromium security sandbox when converting to PDF..
    Equivalent to: [--WebPDFExporter.disable_sandbox=True]
--show-input
    Shows code input. This flag is only useful for dejavu users.
    Equivalent to: [--TemplateExporter.exclude_input=False]
--embed-images
    Embed the images as base64 dataurls in the output. This flag is only useful for the HTML/WebPDF/Slides exports.
    Equivalent to: [--HTMLExporter.embed_images=True]
--sanitize-html
    Whether the HTML in Markdown cells and cell outputs should be sanitized..
    Equivalent to: [--HTMLExporter.sanitize_html=True]
--log-level=<Enum>
    Set the log level by value or name.
    Choices: any of [0, 10, 20, 30, 40, 50, 'DEBUG', 'INFO', 'WARN', 'ERROR', 'CRITICAL']
    Default: 30
    Equivalent to: [--Application.log_level]
--config=<Unicode>
    Full path of a config file.
    Default: ''
    Equivalent to: [--JupyterApp.config_file]
--to=<Unicode>
    The export format to be used, either one of the built-in formats
            ['asciidoc', 'custom', 'html', 'latex', 'markdown', 'notebook', 'pdf', 'python', 'rst', 'script', 'slides', 'webpdf']
            or a dotted object name that represents the import path for an
            ``Exporter`` class
    Default: ''
    Equivalent to: [--NbConvertApp.export_format]
--template=<Unicode>
    Name of the template to use
    Default: ''
    Equivalent to: [--TemplateExporter.template_name]
--template-file=<Unicode>
    Name of the template file to use
    Default: None
    Equivalent to: [--TemplateExporter.template_file]
--theme=<Unicode>
    Template specific theme(e.g. the name of a JupyterLab CSS theme distributed
    as prebuilt extension for the lab template)
    Default: 'light'
    Equivalent to: [--HTMLExporter.theme]
--sanitize_html=<Bool>
    Whether the HTML in Markdown cells and cell outputs should be sanitized.This
    should be set to True by nbviewer or similar tools.
    Default: False
    Equivalent to: [--HTMLExporter.sanitize_html]
--writer=<DottedObjectName>
    Writer class used to write the
                                        results of the conversion
    Default: 'FilesWriter'
    Equivalent to: [--NbConvertApp.writer_class]
--post=<DottedOrNone>
    PostProcessor class used to write the
                                        results of the conversion
    Default: ''
    Equivalent to: [--NbConvertApp.postprocessor_class]
--output=<Unicode>
    overwrite base name use for output files.
                can only be used when converting one notebook at a time.
    Default: ''
    Equivalent to: [--NbConvertApp.output_base]
--output-dir=<Unicode>
    Directory to write output(s) to. Defaults
                                  to output to the directory of each notebook. To recover
                                  previous default behaviour (outputting to the current
                                  working directory) use . as the flag value.
    Default: ''
    Equivalent to: [--FilesWriter.build_directory]
--reveal-prefix=<Unicode>
    The URL prefix for reveal.js (version 3.x).
            This defaults to the reveal CDN, but can be any url pointing to a copy
            of reveal.js.
            For speaker notes to work, this must be a relative path to a local
            copy of reveal.js: e.g., "reveal.js".
            If a relative path is given, it must be a subdirectory of the
            current directory (from which the server is run).
            See the usage documentation
            (https://nbconvert.readthedocs.io/en/latest/usage.html#reveal-js-html-slideshow)
            for more details.
    Default: ''
    Equivalent to: [--SlidesExporter.reveal_url_prefix]
--nbformat=<Enum>
    The nbformat version to write.
            Use this to downgrade notebooks.
    Choices: any of [1, 2, 3, 4]
    Default: 4
    Equivalent to: [--NotebookExporter.nbformat_version]

Examples
--------

    The simplest way to use nbconvert is

            > jupyter nbconvert mynotebook.ipynb --to html

            Options include ['asciidoc', 'custom', 'html', 'latex', 'markdown', 'notebook', 'pdf', 'python', 'rst', 'script', 'slides', 'webpdf'].

            > jupyter nbconvert --to latex mynotebook.ipynb

            Both HTML and LaTeX support multiple output templates. LaTeX includes
            'base', 'article' and 'report'.  HTML includes 'basic', 'lab' and
            'classic'. You can specify the flavor of the format used.

            > jupyter nbconvert --to html --template lab mynotebook.ipynb

            You can also pipe the output to stdout, rather than a file

            > jupyter nbconvert mynotebook.ipynb --stdout

            PDF is generated via latex

            > jupyter nbconvert mynotebook.ipynb --to pdf

            You can get (and serve) a Reveal.js-powered slideshow

            > jupyter nbconvert myslides.ipynb --to slides --post serve

            Multiple notebooks can be given at the command line in a couple of
            different ways:

            > jupyter nbconvert notebook*.ipynb
            > jupyter nbconvert notebook1.ipynb notebook2.ipynb

            or you can specify the notebooks list in a config file, containing::

                c.NbConvertApp.notebooks = ["my_notebook.ipynb"]

            > jupyter nbconvert --config mycfg.py

To see all available configurables, use `--help-all`.

In [ ]: